Topic discovery in massive text corpora based on Min-Hashing
نویسندگان
چکیده
منابع مشابه
Topic Based Analysis of Text Corpora
We present a framework that combines machine learnt classifiers and taxonomies of topics to enable a more conceptual analysis of a corpus than can be accomplished using Vector Space Models and Latent Dirichlet Allocation based topic models which represent documents purely in terms of words. Given a corpus and a taxonomy of topics, we learn a classifier per topic and annotate each document with ...
متن کاملSampled Weighted Min-Hashing for Large-Scale Topic Mining
We present Sampled Weighted Min-Hashing (SWMH), a randomized approach to automatically mine topics from large-scale corpora. SWMH generates multiple random partitions of the corpus vocabulary based on term cooccurrence and agglomerates highly overlapping inter-partition cells to produce the mined topics. While other approaches define a topic as a probabilistic distribution over a vocabulary, SW...
متن کاملAutomated Phrase Mining from Massive Text Corpora
As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus. Phrase mining is important in various tasks including automatic term recognition, document indexing, keyphrase extraction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of...
متن کاملDiscovery of Treatments from Text Corpora
An extensive literature in computational social science examines how features of messages, advertisements, and other corpora affect individuals’ decisions, but these analyses must specify the relevant features of the text before the experiment. Automated text analysis methods are able to discover features of text, but these methods cannot be used to obtain the estimates of causal effects—the qu...
متن کاملFuzzy Approach Topic Discovery in Health and Medical Corpora
The majority of medical documents and electronic health records (EHRs) are in text format that poses a challenge for data processing and finding relevant documents. Looking for ways to automatically retrieve the enormous amount of health and medical knowledge has always been an intriguing topic. Powerful methods have been developed in recent years to make the text processing automatic. One of t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Expert Systems with Applications
سال: 2019
ISSN: 0957-4174
DOI: 10.1016/j.eswa.2019.06.024